pacman::p_load(sf, spdep, tmap, tidyverse, funModeling, shinyjs, cluster, factoextra, heatmaply, ClustGeo, GGally, ggpubr, corrplot, blorr, GWmodel, skimr, caret, stats)
# install.packages("caret", dependencies = TRUE)
library("caret")Inclass5
Overview
Predict the functionality of a water point based on a set of variables. We will build a prediction model using global generalized regression model and geographically weighted generalised regression model and compare the performance of these models using a set of evaluation metrics.
Independent Variables:
There are 2 data sets. One contains Local Government Boundary data and the other contains the water point location.
Independent Variables:
Distance to primary road
Distance to secondary road
Distance to tertiary road
Distance to city
Distance to town
Water Point population
Local Population 1KM
Usage capacity
Urban
Water Source Clean
Dependent Variable:
There are only 2 outcomes for the status water point. It is either functional or non-functional. It is a binary variable. Since this violates the linearity assumptions, linear regression cannot be used.
Benefits of Logistic Regression:
Logistic regression allows us to relax a lot of assumptions.
Logistic Regression does not assume a linear relationship between dependent and independent variables
The independent variables need not be interval nor normally distributed nor linearly related, nor equal variance within each group
Limitations of Logistic Regression:
Large Sample of data is needed
- This is because Logistic regression uses maximum likelihood algorithm which maximises the probability and this requires more data
1.Data Preparation
We load the packages needed and install them if needed. Packages needed are:
- sf, spdep, tmap, tidyverse, funModeling, shinyjs, cluster, factoextra, heatmaply, ClustGeo, GGally
First, we use pacman to load in the library we need.
Next, we are going to import the data
Osun <- read_rds("rds/Osun.rds")
Osun_wp_sf <- read_rds("rds/Osun_wp_sf.rds")We take a look at the column data type and its first few instances.
glimpse(Osun)Rows: 30
Columns: 5
$ ADM2_EN <chr> "Aiyedade", "Aiyedire", "Atakumosa East", "Atakumosa West",…
$ ADM2_PCODE <chr> "NG030001", "NG030002", "NG030003", "NG030004", "NG030005",…
$ ADM1_EN <chr> "Osun", "Osun", "Osun", "Osun", "Osun", "Osun", "Osun", "Os…
$ ADM1_PCODE <chr> "NG030", "NG030", "NG030", "NG030", "NG030", "NG030", "NG03…
$ geometry <MULTIPOLYGON [m]> MULTIPOLYGON (((213526.6 34..., MULTIPOLYGON (…
glimpse(Osun_wp_sf)Rows: 4,760
Columns: 75
$ row_id <dbl> 70578, 66401, 65607, 68989, 67708, 66419, 6…
$ source <chr> "Federal Ministry of Water Resources, Niger…
$ lat_deg <dbl> 7.759448, 8.031187, 7.891137, 7.509588, 7.4…
$ lon_deg <dbl> 4.563998, 4.637400, 4.713438, 4.265002, 4.3…
$ report_date <chr> "05/11/2015 12:00:00 AM", "04/30/2015 12:00…
$ status_id <chr> "No", "No", "No", "No", "Yes", "Yes", "Yes"…
$ water_source_clean <chr> "Borehole", "Borehole", "Borehole", "Boreho…
$ water_source_category <chr> "Well", "Well", "Well", "Well", "Well", "We…
$ water_tech_clean <chr> "Mechanized Pump", "Mechanized Pump", "Mech…
$ water_tech_category <chr> "Mechanized Pump", "Mechanized Pump", "Mech…
$ facility_type <chr> "Improved", "Improved", "Improved", "Improv…
$ clean_country_name <chr> "Nigeria", "Nigeria", "Nigeria", "Nigeria",…
$ clean_adm1 <chr> "Osun", "Osun", "Osun", "Osun", "Osun", "Os…
$ clean_adm2 <chr> "Osogbo", "Odo-Otin", "Boripe", "Ayedire", …
$ clean_adm3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ clean_adm4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ install_year <dbl> NA, 2004, 2006, 2014, 2011, 2007, 2007, 200…
$ installer <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ rehab_year <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ rehabilitator <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ management_clean <chr> NA, NA, NA, NA, NA, "Community Management",…
$ status_clean <chr> "Abandoned/Decommissioned", "Abandoned/Deco…
$ pay <chr> "No", "No", "No", "No", "No", "No", "No", "…
$ fecal_coliform_presence <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fecal_coliform_value <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ subjective_quality <chr> "Acceptable quality", "Acceptable quality",…
$ activity_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ scheme_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wpdx_id <chr> "6FV6QH57+QHH", "6FW62JJP+FXC", "6FV6VPR7+F…
$ notes <chr> "COSTAIN area, Osogbo", "Igbaye", "Araromi…
$ orig_lnk <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ photo_lnk <chr> "https://akvoflow-55.s3.amazonaws.com/image…
$ country_id <chr> "NG", "NG", "NG", "NG", "NG", "NG", "NG", "…
$ data_lnk <chr> "https://catalog.waterpointdata.org/dataset…
$ distance_to_primary_road <dbl> 167.82235, 4133.71306, 6559.76182, 8260.715…
$ distance_to_secondary_road <dbl> 838.9185, 1162.6246, 4625.0324, 5854.5731, …
$ distance_to_tertiary_road <dbl> 1181.107236, 9.012647, 58.314987, 2013.6515…
$ distance_to_city <dbl> 2449.2931, 16704.1923, 21516.8495, 31765.68…
$ distance_to_town <dbl> 9463.295, 5176.899, 8589.103, 16386.459, 23…
$ water_point_history <chr> "{\"2015-05-11\": {\"photo_lnk\": \"https:/…
$ rehab_priority <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ water_point_population <dbl> NA, NA, NA, NA, 0, 508, 162, 362, 3562, 217…
$ local_population_1km <dbl> NA, NA, NA, NA, 70, 647, 449, 1611, 7199, 2…
$ crucialness_score <dbl> NA, NA, NA, NA, NA, 0.785162287, 0.36080178…
$ pressure_score <dbl> NA, NA, NA, NA, NA, 1.69333333, 0.54000000,…
$ usage_capacity <dbl> 1000, 1000, 1000, 300, 1000, 300, 300, 300,…
$ is_urban <lgl> TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALS…
$ days_since_report <dbl> 2689, 2700, 2688, 2693, 2701, 2692, 2681, 2…
$ staleness_score <dbl> 42.84384, 42.69554, 42.85735, 42.78986, 42.…
$ latest_record <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
$ location_id <dbl> 239556, 230405, 240009, 236400, 229231, 237…
$ cluster_size <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
$ clean_country_id <chr> "NGA", "NGA", "NGA", "NGA", "NGA", "NGA", "…
$ country_name <chr> "Nigeria", "Nigeria", "Nigeria", "Nigeria",…
$ water_source <chr> "Improved Tube well or borehole", "Improved…
$ water_tech <chr> "Motorised", "Motorised", "Motorised", "yet…
$ status <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRU…
$ adm2 <chr> "Osogbo", "Odo-Otin", "Boripe", "Aiyedire",…
$ adm3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ management <chr> NA, NA, NA, NA, NA, "Community Management",…
$ adm1 <chr> "Osun", "Osun", "Osun", "Osun", "Osun", "Os…
$ `New Georeferenced Column` <chr> "POINT (4.5639983 7.7594483)", "POINT (4.63…
$ lat_deg_original <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lat_lon_deg <chr> "(7.7594483°, 4.5639983°)", "(8.0311867°, 4…
$ lon_deg_original <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ public_data_source <chr> "https://catalog.waterpointdata.org/datafil…
$ converted <chr> "#status_id, #water_source, #pay, #status, …
$ count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ created_timestamp <chr> "06/30/2020 12:56:07 PM", "06/30/2020 12:56…
$ updated_timestamp <chr> "06/30/2020 12:56:07 PM", "06/30/2020 12:56…
$ Geometry <POINT [m]> POINT (236239.7 417577), POINT (24463…
$ ADM2_EN <chr> "Osogbo", "Odo-Otin", "Boripe", "Aiyedire",…
$ ADM2_PCODE <chr> "NG030030", "NG030025", "NG030006", "NG0300…
$ ADM1_EN <chr> "Osun", "Osun", "Osun", "Osun", "Osun", "Os…
$ ADM1_PCODE <chr> "NG030", "NG030", "NG030", "NG030", "NG030"…
Osun_wp_sf %>%
freq(input = 'status')
status frequency percentage cumulative_perc
1 TRUE 2642 55.5 55.5
2 FALSE 2118 44.5 100.0
There are 2 categories of status - true and false. 55.5% are true and 44.5% are false.
We use the skim() function to help us with our EDA. It shows us the summary statistics. Regression models are sensitive to missing values. We can identify some of the columns with many missing fields using this method. We will not be using columns where there are too many missing values. If there are just a few, we remove those rows of data point.
Osun_wp_sf %>%
skim()| Name | Piped data |
| Number of rows | 4760 |
| Number of columns | 75 |
| _______________________ | |
| Column type frequency: | |
| character | 47 |
| logical | 5 |
| numeric | 23 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1.00 | 5 | 44 | 0 | 2 | 0 |
| report_date | 0 | 1.00 | 22 | 22 | 0 | 42 | 0 |
| status_id | 0 | 1.00 | 2 | 7 | 0 | 3 | 0 |
| water_source_clean | 0 | 1.00 | 8 | 22 | 0 | 3 | 0 |
| water_source_category | 0 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| water_tech_clean | 24 | 0.99 | 9 | 23 | 0 | 3 | 0 |
| water_tech_category | 24 | 0.99 | 9 | 15 | 0 | 2 | 0 |
| facility_type | 0 | 1.00 | 8 | 8 | 0 | 1 | 0 |
| clean_country_name | 0 | 1.00 | 7 | 7 | 0 | 1 | 0 |
| clean_adm1 | 0 | 1.00 | 3 | 5 | 0 | 5 | 0 |
| clean_adm2 | 0 | 1.00 | 3 | 14 | 0 | 35 | 0 |
| clean_adm3 | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| clean_adm4 | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| installer | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| management_clean | 1573 | 0.67 | 5 | 37 | 0 | 7 | 0 |
| status_clean | 0 | 1.00 | 9 | 32 | 0 | 7 | 0 |
| pay | 0 | 1.00 | 2 | 39 | 0 | 7 | 0 |
| fecal_coliform_presence | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| subjective_quality | 0 | 1.00 | 18 | 20 | 0 | 4 | 0 |
| activity_id | 4757 | 0.00 | 36 | 36 | 0 | 3 | 0 |
| scheme_id | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| wpdx_id | 0 | 1.00 | 12 | 12 | 0 | 4760 | 0 |
| notes | 0 | 1.00 | 2 | 96 | 0 | 3502 | 0 |
| orig_lnk | 4757 | 0.00 | 84 | 84 | 0 | 1 | 0 |
| photo_lnk | 41 | 0.99 | 84 | 84 | 0 | 4719 | 0 |
| country_id | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| data_lnk | 0 | 1.00 | 79 | 96 | 0 | 2 | 0 |
| water_point_history | 0 | 1.00 | 142 | 834 | 0 | 4750 | 0 |
| clean_country_id | 0 | 1.00 | 3 | 3 | 0 | 1 | 0 |
| country_name | 0 | 1.00 | 7 | 7 | 0 | 1 | 0 |
| water_source | 0 | 1.00 | 8 | 30 | 0 | 4 | 0 |
| water_tech | 0 | 1.00 | 5 | 37 | 0 | 20 | 0 |
| adm2 | 0 | 1.00 | 3 | 14 | 0 | 33 | 0 |
| adm3 | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| management | 1573 | 0.67 | 5 | 47 | 0 | 7 | 0 |
| adm1 | 0 | 1.00 | 4 | 5 | 0 | 4 | 0 |
| New Georeferenced Column | 0 | 1.00 | 16 | 35 | 0 | 4760 | 0 |
| lat_lon_deg | 0 | 1.00 | 13 | 32 | 0 | 4760 | 0 |
| public_data_source | 0 | 1.00 | 84 | 102 | 0 | 2 | 0 |
| converted | 0 | 1.00 | 53 | 53 | 0 | 1 | 0 |
| created_timestamp | 0 | 1.00 | 22 | 22 | 0 | 2 | 0 |
| updated_timestamp | 0 | 1.00 | 22 | 22 | 0 | 2 | 0 |
| Geometry | 0 | 1.00 | 33 | 37 | 0 | 4760 | 0 |
| ADM2_EN | 0 | 1.00 | 3 | 14 | 0 | 30 | 0 |
| ADM2_PCODE | 0 | 1.00 | 8 | 8 | 0 | 30 | 0 |
| ADM1_EN | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| ADM1_PCODE | 0 | 1.00 | 5 | 5 | 0 | 1 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| rehab_year | 4760 | 0 | NaN | : |
| rehabilitator | 4760 | 0 | NaN | : |
| is_urban | 0 | 1 | 0.39 | FAL: 2884, TRU: 1876 |
| latest_record | 0 | 1 | 1.00 | TRU: 4760 |
| status | 0 | 1 | 0.56 | TRU: 2642, FAL: 2118 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| row_id | 0 | 1.00 | 68550.48 | 10216.94 | 49601.00 | 66874.75 | 68244.50 | 69562.25 | 471319.00 | ▇▁▁▁▁ |
| lat_deg | 0 | 1.00 | 7.68 | 0.22 | 7.06 | 7.51 | 7.71 | 7.88 | 8.06 | ▁▂▇▇▇ |
| lon_deg | 0 | 1.00 | 4.54 | 0.21 | 4.08 | 4.36 | 4.56 | 4.71 | 5.06 | ▃▆▇▇▂ |
| install_year | 1144 | 0.76 | 2008.63 | 6.04 | 1917.00 | 2006.00 | 2010.00 | 2013.00 | 2015.00 | ▁▁▁▁▇ |
| fecal_coliform_value | 4760 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| distance_to_primary_road | 0 | 1.00 | 5021.53 | 5648.34 | 0.01 | 719.36 | 2972.78 | 7314.73 | 26909.86 | ▇▂▁▁▁ |
| distance_to_secondary_road | 0 | 1.00 | 3750.47 | 3938.63 | 0.15 | 460.90 | 2554.25 | 5791.94 | 19559.48 | ▇▃▁▁▁ |
| distance_to_tertiary_road | 0 | 1.00 | 1259.28 | 1680.04 | 0.02 | 121.25 | 521.77 | 1834.42 | 10966.27 | ▇▂▁▁▁ |
| distance_to_city | 0 | 1.00 | 16663.99 | 10960.82 | 53.05 | 7930.75 | 15030.41 | 24255.75 | 47934.34 | ▇▇▆▃▁ |
| distance_to_town | 0 | 1.00 | 16726.59 | 12452.65 | 30.00 | 6876.92 | 12204.53 | 27739.46 | 44020.64 | ▇▅▃▃▂ |
| rehab_priority | 2654 | 0.44 | 489.33 | 1658.81 | 0.00 | 7.00 | 91.50 | 376.25 | 29697.00 | ▇▁▁▁▁ |
| water_point_population | 4 | 1.00 | 513.58 | 1458.92 | 0.00 | 14.00 | 119.00 | 433.25 | 29697.00 | ▇▁▁▁▁ |
| local_population_1km | 4 | 1.00 | 2727.16 | 4189.46 | 0.00 | 176.00 | 1032.00 | 3717.00 | 36118.00 | ▇▁▁▁▁ |
| crucialness_score | 798 | 0.83 | 0.26 | 0.28 | 0.00 | 0.07 | 0.15 | 0.35 | 1.00 | ▇▃▁▁▁ |
| pressure_score | 798 | 0.83 | 1.46 | 4.16 | 0.00 | 0.12 | 0.41 | 1.24 | 93.69 | ▇▁▁▁▁ |
| usage_capacity | 0 | 1.00 | 560.74 | 338.46 | 300.00 | 300.00 | 300.00 | 1000.00 | 1000.00 | ▇▁▁▁▅ |
| days_since_report | 0 | 1.00 | 2692.69 | 41.92 | 1483.00 | 2688.00 | 2693.00 | 2700.00 | 4645.00 | ▁▇▁▁▁ |
| staleness_score | 0 | 1.00 | 42.80 | 0.58 | 23.13 | 42.70 | 42.79 | 42.86 | 62.66 | ▁▁▇▁▁ |
| location_id | 0 | 1.00 | 235865.49 | 6657.60 | 23741.00 | 230638.75 | 236199.50 | 240061.25 | 267454.00 | ▁▁▁▁▇ |
| cluster_size | 0 | 1.00 | 1.05 | 0.25 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | ▇▁▁▁▁ |
| lat_deg_original | 4760 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| lon_deg_original | 4760 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| count | 0 | 1.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▇▁▁ |
We remove missing values with below code chunk:
Osun_wp_sf_clean <- Osun_wp_sf %>%
filter_at(vars(status,
distance_to_primary_road,
distance_to_secondary_road,
distance_to_city,
distance_to_town,
water_point_population,
local_population_1km,
usage_capacity,
is_urban,
water_source_clean),
all_vars(!is.na(.))) %>%
mutate(usage_capacity = as.factor(usage_capacity))We convert usage capacity as a category by using as.factor. Data is now categorical instead of numerical.
1.1 Correlation Analysis
To do correlation analysis, it does not recognise data format with geometry. Hence we set it as NULL to remove it.
Osun_wp <- Osun_wp_sf_clean %>%
select(c(7,35:39, 42:43, 46:47, 57)) %>%
st_set_geometry(NULL)We take a subset of the Osun_wp from column 2 to 7.
cluster_vars.cor = cor(
Osun_wp[,2:7])
corrplot.mixed(cluster_vars.cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
tl.col = "black")
None of them have high correlation. There will be no problem of multicollinearity.We can start building out model.
2. Building Global Regression Model
model <- glm(status ~ distance_to_primary_road +
distance_to_secondary_road +
distance_to_tertiary_road +
distance_to_city +
distance_to_town +
is_urban +
usage_capacity +
water_source_clean +
water_point_population +
local_population_1km,
data = Osun_wp_sf_clean,
family = binomial(link = "logit"))We check the summary of the model using blr_regress.
blr_regress(model) Model Overview
------------------------------------------------------------------------
Data Set Resp Var Obs. Df. Model Df. Residual Convergence
------------------------------------------------------------------------
data status 4756 4755 4744 TRUE
------------------------------------------------------------------------
Response Summary
--------------------------------------------------------
Outcome Frequency Outcome Frequency
--------------------------------------------------------
0 2114 1 2642
--------------------------------------------------------
Maximum Likelihood Estimates
-----------------------------------------------------------------------------------------------
Parameter DF Estimate Std. Error z value Pr(>|z|)
-----------------------------------------------------------------------------------------------
(Intercept) 1 0.3887 0.1124 3.4588 5e-04
distance_to_primary_road 1 0.0000 0.0000 -0.7153 0.4744
distance_to_secondary_road 1 0.0000 0.0000 -0.5530 0.5802
distance_to_tertiary_road 1 1e-04 0.0000 4.6708 0.0000
distance_to_city 1 0.0000 0.0000 -4.7574 0.0000
distance_to_town 1 0.0000 0.0000 -4.9170 0.0000
is_urbanTRUE 1 -0.2971 0.0819 -3.6294 3e-04
usage_capacity1000 1 -0.6230 0.0697 -8.9366 0.0000
water_source_cleanProtected Shallow Well 1 0.5040 0.0857 5.8783 0.0000
water_source_cleanProtected Spring 1 1.2882 0.4388 2.9359 0.0033
water_point_population 1 -5e-04 0.0000 -11.3686 0.0000
local_population_1km 1 3e-04 0.0000 19.2953 0.0000
-----------------------------------------------------------------------------------------------
Association of Predicted Probabilities and Observed Responses
---------------------------------------------------------------
% Concordant 0.7347 Somers' D 0.4693
% Discordant 0.2653 Gamma 0.4693
% Tied 0.0000 Tau-a 0.2318
Pairs 5585188 c 0.7347
---------------------------------------------------------------
Most of the features are significant predictors to use except for: distance_to_primary_road and distance_to_secondary_road.
Among the significant predictors, there is negative correlation for is_urban_true and usage_capacity1000 feature. Rest of the features have a positive correlation.
2.1 Model Evaluation
blr_confusion_matrix(model, cutoff = 0.5)Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
0 1301 738
1 813 1904
Accuracy : 0.6739
No Information Rate : 0.4445
Kappa : 0.3373
McNemars's Test P-Value : 0.0602
Sensitivity : 0.7207
Specificity : 0.6154
Pos Pred Value : 0.7008
Neg Pred Value : 0.6381
Prevalence : 0.5555
Detection Rate : 0.4003
Detection Prevalence : 0.5713
Balanced Accuracy : 0.6680
Precision : 0.7008
Recall : 0.7207
'Positive' Class : 1
Next, we evaluate the performance of the model. We set the cut off probability as 0.5. If probability outputted by LR model is greater than 0.5, it is a functional water point. Otherwise non-functional water point.
We compare the predicted values against the actual value. The accuracy of model is 67% and sensitivity/ true positive rate is 72% which is a moderate classifying performance.
2.2 Model Improvement
As we have determined the distance to primary road and distance to secondary road are insignificant features, we will remove it from our model building and rerun the summary report to see if there is an improvement.
model_cleaned <- glm(status ~ distance_to_tertiary_road +
distance_to_city +
distance_to_town +
is_urban +
usage_capacity +
water_source_clean +
water_point_population +
local_population_1km,
data = Osun_wp_sf_clean,
family = binomial(link = "logit"))
blr_regress(model_cleaned) Model Overview
------------------------------------------------------------------------
Data Set Resp Var Obs. Df. Model Df. Residual Convergence
------------------------------------------------------------------------
data status 4756 4755 4746 TRUE
------------------------------------------------------------------------
Response Summary
--------------------------------------------------------
Outcome Frequency Outcome Frequency
--------------------------------------------------------
0 2114 1 2642
--------------------------------------------------------
Maximum Likelihood Estimates
-----------------------------------------------------------------------------------------------
Parameter DF Estimate Std. Error z value Pr(>|z|)
-----------------------------------------------------------------------------------------------
(Intercept) 1 0.3540 0.1055 3.3541 8e-04
distance_to_tertiary_road 1 1e-04 0.0000 4.9096 0.0000
distance_to_city 1 0.0000 0.0000 -5.2022 0.0000
distance_to_town 1 0.0000 0.0000 -5.4660 0.0000
is_urbanTRUE 1 -0.2667 0.0747 -3.5690 4e-04
usage_capacity1000 1 -0.6206 0.0697 -8.9081 0.0000
water_source_cleanProtected Shallow Well 1 0.4947 0.0850 5.8228 0.0000
water_source_cleanProtected Spring 1 1.2790 0.4384 2.9174 0.0035
water_point_population 1 -5e-04 0.0000 -11.3902 0.0000
local_population_1km 1 3e-04 0.0000 19.4069 0.0000
-----------------------------------------------------------------------------------------------
Association of Predicted Probabilities and Observed Responses
---------------------------------------------------------------
% Concordant 0.7349 Somers' D 0.4697
% Discordant 0.2651 Gamma 0.4697
% Tied 0.0000 Tau-a 0.2320
Pairs 5585188 c 0.7349
---------------------------------------------------------------
blr_confusion_matrix(model_cleaned, cutoff = 0.5)Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
0 1300 743
1 814 1899
Accuracy : 0.6726
No Information Rate : 0.4445
Kappa : 0.3348
McNemars's Test P-Value : 0.0761
Sensitivity : 0.7188
Specificity : 0.6149
Pos Pred Value : 0.7000
Neg Pred Value : 0.6363
Prevalence : 0.5555
Detection Rate : 0.3993
Detection Prevalence : 0.5704
Balanced Accuracy : 0.6669
Precision : 0.7000
Recall : 0.7188
'Positive' Class : 1
Accuracy remains at 67% and senstivity/ true positive rate reamins at 72%.
How can we improve the model? We can improve the model by adding geographical weights using the GW model.
3. Model building with Geographical Weights using GWR method
GW model takes in only spatial dataframe format. Hence we need to convert it using the as_Spatial() method.
Osun_wp_sp <- Osun_wp_sf_clean %>%
select(c(status, distance_to_primary_road,
distance_to_secondary_road,
distance_to_tertiary_road,
distance_to_city,
distance_to_town,
water_point_population,
local_population_1km,
is_urban,
usage_capacity,
water_source_clean)) %>%
as_Spatial()
Osun_wp_spclass : SpatialPointsDataFrame
features : 4756
extent : 182502.4, 290751, 340054.1, 450905.3 (xmin, xmax, ymin, ymax)
crs : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs
variables : 11
names : status, distance_to_primary_road, distance_to_secondary_road, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, is_urban, usage_capacity, water_source_clean
min values : 0, 0.014461356813335, 0.152195902540837, 0.017815121653488, 53.0461399623541, 30.0019777713073, 0, 0, 0, 1000, Borehole
max values : 1, 26909.8616132094, 19559.4793799085, 10966.2705628969, 47934.343603562, 44020.6393368124, 29697, 36118, 1, 300, Protected Spring
Next we calculate the distance bandwidth metrics. We use the fixed distance bandwidth metrics here.
bw.fixed <- bw.ggwr(status ~ distance_to_primary_road +
distance_to_secondary_road +
distance_to_tertiary_road +
distance_to_city +
distance_to_town +
is_urban +
usage_capacity +
water_source_clean +
water_point_population +
local_population_1km,
data = Osun_wp_sp,
family = "binomial",
approach = "AIC",
kernel = "gaussian",
adaptive = FALSE,
longlat = FALSE)From above it is found bw.fixed is 2599.672.
bw.fixed <- 2599.672
bw.fixed[1] 2599.672
The fixed bandwidth distance used will be 2599.672m. We build the geographical weighted generalised model using the ggwr.fixed() method and defining the fixed bandwith distance to use.
gwlr.fixed <- ggwr.basic(status ~ distance_to_primary_road +
distance_to_secondary_road +
distance_to_tertiary_road +
distance_to_city +
distance_to_town +
is_urban +
usage_capacity +
water_source_clean +
water_point_population +
local_population_1km,
data = Osun_wp_sp,
bw = bw.fixed,
family = "binomial",
kernel = "gaussian",
adaptive = FALSE,
longlat = FALSE) Iteration Log-Likelihood
=========================
0 -1958
1 -1676
2 -1526
3 -1443
4 -1405
5 -1405
gwlr.fixed ***********************************************************************
* Package GWmodel *
***********************************************************************
Program starts at: 2022-12-17 14:00:35
Call:
ggwr.basic(formula = status ~ distance_to_primary_road + distance_to_secondary_road +
distance_to_tertiary_road + distance_to_city + distance_to_town +
is_urban + usage_capacity + water_source_clean + water_point_population +
local_population_1km, data = Osun_wp_sp, bw = bw.fixed, family = "binomial",
kernel = "gaussian", adaptive = FALSE, longlat = FALSE)
Dependent (y) variable: status
Independent variables: distance_to_primary_road distance_to_secondary_road distance_to_tertiary_road distance_to_city distance_to_town is_urban usage_capacity water_source_clean water_point_population local_population_1km
Number of data points: 4756
Used family: binomial
***********************************************************************
* Results of Generalized linear Regression *
***********************************************************************
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-124.555 -1.755 1.072 1.742 34.333
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Intercept 3.887e-01 1.124e-01 3.459 0.000543
distance_to_primary_road -4.642e-06 6.490e-06 -0.715 0.474422
distance_to_secondary_road -5.143e-06 9.299e-06 -0.553 0.580230
distance_to_tertiary_road 9.683e-05 2.073e-05 4.671 3.00e-06
distance_to_city -1.686e-05 3.544e-06 -4.757 1.96e-06
distance_to_town -1.480e-05 3.009e-06 -4.917 8.79e-07
is_urbanTRUE -2.971e-01 8.185e-02 -3.629 0.000284
usage_capacity1000 -6.230e-01 6.972e-02 -8.937 < 2e-16
water_source_cleanProtected Shallow Well 5.040e-01 8.574e-02 5.878 4.14e-09
water_source_cleanProtected Spring 1.288e+00 4.388e-01 2.936 0.003325
water_point_population -5.097e-04 4.484e-05 -11.369 < 2e-16
local_population_1km 3.451e-04 1.788e-05 19.295 < 2e-16
Intercept ***
distance_to_primary_road
distance_to_secondary_road
distance_to_tertiary_road ***
distance_to_city ***
distance_to_town ***
is_urbanTRUE ***
usage_capacity1000 ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring **
water_point_population ***
local_population_1km ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6534.5 on 4755 degrees of freedom
Residual deviance: 5688.0 on 4744 degrees of freedom
AIC: 5712
Number of Fisher Scoring iterations: 5
AICc: 5712.099
Pseudo R-square value: 0.1295351
***********************************************************************
* Results of Geographically Weighted Regression *
***********************************************************************
*********************Model calibration information*********************
Kernel function: gaussian
Fixed bandwidth: 2599.672
Regression points: the same locations as observations are used.
Distance metric: A distance matrix is specified for this model calibration.
************Summary of Generalized GWR coefficient estimates:**********
Min. 1st Qu. Median
Intercept -8.7229e+02 -4.9955e+00 1.7600e+00
distance_to_primary_road -1.9389e-02 -4.8031e-04 2.9618e-05
distance_to_secondary_road -1.5921e-02 -3.7551e-04 1.2317e-04
distance_to_tertiary_road -1.5618e-02 -4.2368e-04 7.6179e-05
distance_to_city -1.8416e-02 -5.6217e-04 -1.2726e-04
distance_to_town -2.2411e-02 -5.7283e-04 -1.5155e-04
is_urbanTRUE -1.9790e+02 -4.2908e+00 -1.6864e+00
usage_capacity1000 -2.0772e+01 -9.7231e-01 -4.1592e-01
water_source_cleanProtected.Shallow.Well -2.0789e+01 -4.5190e-01 5.3340e-01
water_source_cleanProtected.Spring -5.2235e+02 -5.5977e+00 2.5441e+00
water_point_population -5.2208e-02 -2.2767e-03 -9.8875e-04
local_population_1km -1.2698e-01 4.9952e-04 1.0638e-03
3rd Qu. Max.
Intercept 1.2763e+01 1073.2156
distance_to_primary_road 4.8443e-04 0.0142
distance_to_secondary_road 6.0692e-04 0.0258
distance_to_tertiary_road 6.6815e-04 0.0128
distance_to_city 2.3718e-04 0.0150
distance_to_town 1.9271e-04 0.0224
is_urbanTRUE 1.2841e+00 744.3099
usage_capacity1000 3.0322e-01 5.9281
water_source_cleanProtected.Shallow.Well 1.7849e+00 67.6343
water_source_cleanProtected.Spring 6.7663e+00 317.4133
water_point_population 5.0102e-04 0.1309
local_population_1km 1.8157e-03 0.0392
************************Diagnostic information*************************
Number of data points: 4756
GW Deviance: 2795.084
AIC : 4414.606
AICc : 4747.423
Pseudo R-square value: 0.5722559
***********************************************************************
Program stops at: 2022-12-17 14:01:50
Geographically weighted generalized regression works by building multiple generalized linear models for locations in same bandwidth. Each model has their own localized geographical weights.
We observe like in the global generalised linear model, the metrics distance to primary road and distance to secondary road are not significant as they have p values higher than 0.05.
From the summary report above, we can also see the model with geographical weights performs better with a lower AIC . Hence it is a better model with geographical weights.
3.1 Model Comparison
We want to compute a confusion matrix. We need to cover the SDF object into a date frame.
gwr.fixed <- as.data.frame(gwlr.fixed$SDF)We will label predicted values with probability higher than 0.5 into 1 and else 0. The result will be saved as most column.
gwr.fixed <- gwr.fixed %>%
mutate(most = ifelse(
gwr.fixed$yhat >= 0.5, T, F))There is no easy method to print out confusion report. Hence we create our own using confusion Matrix from caret library using the actual values column and gwr.fixed and the newly appended column “most”.
gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data=gwr.fixed$most, reference = gwr.fixed$y)
CMConfusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 1824 263
TRUE 290 2379
Accuracy : 0.8837
95% CI : (0.8743, 0.8927)
No Information Rate : 0.5555
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7642
Mcnemar's Test P-Value : 0.2689
Sensitivity : 0.8628
Specificity : 0.9005
Pos Pred Value : 0.8740
Neg Pred Value : 0.8913
Prevalence : 0.4445
Detection Rate : 0.3835
Detection Prevalence : 0.4388
Balanced Accuracy : 0.8816
'Positive' Class : FALSE
By using geographical weighted model, the sensitivity has improved to 86.3% and accuracy to 88.4%.
3.2 Model Improvement
Can we improve the model if we remove the insignificant variables? We create the distance bandwidth metrics without the “distance to primary road” and “distance to secondary road” metrics.
bw.fixed_cleaned <- bw.ggwr(status ~ distance_to_tertiary_road +
distance_to_city +
distance_to_town +
is_urban +
usage_capacity +
water_source_clean +
water_point_population +
local_population_1km,
data = Osun_wp_sp,
family = "binomial",
approach = "AIC",
kernel = "gaussian",
adaptive = FALSE,
longlat = FALSE)Lets check what bandwidth we should use.
bw.fixed_cleaned <- 2377.371
bw.fixed_cleaned[1] 2377.371
The optimal fixed bandwith distance to use it 2377.371m. Next we build the model using the distance bandwidth matrix built earlier.
gwlr.fixed_cleaned <- ggwr.basic(status ~distance_to_tertiary_road +
distance_to_city +
distance_to_town +
is_urban +
usage_capacity +
water_source_clean +
water_point_population +
local_population_1km,
data = Osun_wp_sp,
bw = bw.fixed_cleaned,
family = "binomial",
kernel = "gaussian",
adaptive = FALSE,
longlat = FALSE) Iteration Log-Likelihood
=========================
0 -1959
1 -1680
2 -1531
3 -1447
4 -1413
5 -1413
Next we check the model evaluation metrics
gwr.fixed_cleaned <- as.data.frame(gwlr.fixed_cleaned$SDF)
gwr.fixed_cleaned <- gwr.fixed_cleaned %>%
mutate(most = ifelse(
gwr.fixed_cleaned$yhat >= 0.5, T, F))
gwr.fixed_cleaned$y <- as.factor(gwr.fixed_cleaned$y)
gwr.fixed_cleaned$most <- as.factor(gwr.fixed_cleaned$most)
CM_cleaned <- confusionMatrix(data=gwr.fixed_cleaned$most, reference = gwr.fixed_cleaned$y)
CM_cleanedConfusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 1833 268
TRUE 281 2374
Accuracy : 0.8846
95% CI : (0.8751, 0.8935)
No Information Rate : 0.5555
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7661
Mcnemar's Test P-Value : 0.6085
Sensitivity : 0.8671
Specificity : 0.8986
Pos Pred Value : 0.8724
Neg Pred Value : 0.8942
Prevalence : 0.4445
Detection Rate : 0.3854
Detection Prevalence : 0.4418
Balanced Accuracy : 0.8828
'Positive' Class : FALSE
By removing the insignificant variables using geographical weighted model, the sensitivity has improved to 86.7% from 86.3% while accuracy remains the same at 88.4%.
4. Visualization
Next we visualize the Local government areas water point functionality prediction. We set tm_dots to display only 2 categories - below 0.5 and above 0.5
Osun_wp_sf_selected <- Osun_wp_sf_clean %>%
select(c(ADM2_EN, ADM2_PCODE,ADM1_EN, ADM1_PCODE, status))gwr_sf.fixed <- cbind(Osun_wp_sf_selected, gwr.fixed_cleaned)tmap_mode("view")
prob_t <- tm_shape(Osun) +
tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf.fixed) +
tm_dots(col = "yhat",
border.col = "gray60",
border.lwd = 1,
n=2) +
tm_view(set.zoom.limits = c(8,14))
prob_tWe also plot the original graph with original values.
og_y <- tm_shape(Osun) +
tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf.fixed) +
tm_dots(col = "y",
border.col = "gray60",
border.lwd = 1,
n=2,
palette = "Blues") +
tm_view(set.zoom.limits = c(8,14))
og_yWe put them side by side. As we can see there are more misclassifications in the North West region. The predicted values predicted more TRUE (False Positives) than Actual TRUE.
4.1 Tuning Threshold
One way to address it is to increase the threshold. This reduces the number of False Positive but increases the number of False Negative.
In this context, a model that predicts less False positive is preferred because it is a more conservative approach to monitor water point functionality. It will be worse off if taps that are not working are marked as working thus shortchaning it of timely repairs which in return deprives villagers from access to water.
tmap_arrange(prob_t,og_y)We increase threshold to 0.6.
gwr.fixed_cleaned <- as.data.frame(gwlr.fixed_cleaned$SDF)
gwr.fixed_cleaned <- gwr.fixed_cleaned %>%
mutate(most = ifelse(
gwr.fixed_cleaned$yhat >= 0.6, T, F))
gwr.fixed_cleaned$y <- as.factor(gwr.fixed_cleaned$y)
gwr.fixed_cleaned$most <- as.factor(gwr.fixed_cleaned$most)
CM_cleaned <- confusionMatrix(data=gwr.fixed_cleaned$most, reference = gwr.fixed_cleaned$y)
CM_cleanedConfusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 1976 472
TRUE 138 2170
Accuracy : 0.8717
95% CI : (0.8619, 0.8811)
No Information Rate : 0.5555
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7443
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9347
Specificity : 0.8213
Pos Pred Value : 0.8072
Neg Pred Value : 0.9402
Prevalence : 0.4445
Detection Rate : 0.4155
Detection Prevalence : 0.5147
Balanced Accuracy : 0.8780
'Positive' Class : FALSE
gwr_sf.fixed2 <- cbind(Osun_wp_sf_selected, gwr.fixed_cleaned)
prob_t_2 <- tm_shape(Osun) +
tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf.fixed2) +
tm_dots(col = "yhat",
border.col = "gray60",
border.lwd = 1,
breaks = c(0,0.6,1)) +
tm_view(set.zoom.limits = c(8,14))
og_y_2 <- tm_shape(Osun) +
tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf.fixed2) +
tm_dots(col = "y",
border.col = "gray60",
border.lwd = 1,
palette = "Blues") +
tm_view(set.zoom.limits = c(8,14))Increasing threshold has improved the sensitivity. Now we visualize it in a graph. The 2 graphs look more similar.
tmap_arrange(prob_t_2,og_y_2)4.2 Visualising Metrics
We can also visualize the metrics of the variables. gwr_sf.fixed contains both original values (ends with SE) and t statistics value (TV).
tertiary_TV <- tm_shape(Osun) +
tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf.fixed) +
tm_dots(col = "distance_to_tertiary_road_TV", border.col = "gray60", border.lwd=4, palette = "Purples") +
tm_view(set.zoom.limits = c(8,14))
tertiary_SE <- tm_shape(Osun) +
tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf.fixed) +
tm_dots(col = "distance_to_tertiary_road_SE", border.col = "gray60", border.lwd=4) +
tm_view(set.zoom.limits = c(8,14))
tmap_arrange(tertiary_SE, tertiary_TV, asp=1, ncol=2, sync = TRUE)town_TV <- tm_shape(Osun) +
tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf.fixed) +
tm_dots(col = "distance_to_town_TV", border.col = "gray60", border.lwd=4, palette = "Purples") +
tm_view(set.zoom.limits = c(8,14))
town_SE <- tm_shape(Osun) +
tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf.fixed) +
tm_dots(col = "distance_to_town_SE", border.col = "gray60", border.lwd=4) +
tm_view(set.zoom.limits = c(8,14))
tmap_arrange(town_SE, town_TV, asp=1, ncol=2, sync = TRUE)